library(Hmisc)
library(tidyverse)Homework 1
Load Packages
Problem 1
Survey
Wednesday Sept. 4 at 1:43pm
Campuswire
Insert the image you uploaded to Campuswire here.
Problem 2
Question 1
The study population for data set 1 are people 16 years and older who do not live in communal residences, which makes the study population broad and gives a wide variety of who could be surveyed (men/women, older people/younger people, etc.). The study population for data set 2 are the police officers and their crime records, which is more specific and narrow.
Question 2
The sampling strategy for data set 1 is voluntary. The sampling strategy for data set 2 is convenience sampling because they are using the files that they already have.
Question 3
The sampled population for data set 1 is the 38,000 people who are 16 and older and do not live in communal residences. The sampled population for data set 2 are the prewritten files.
Question 4
The target population of the study are UK residents.
Question 5
The reliability of data set 1 is somewhat reliable as it is self-reported and people may not be completely honest in their answers. The reliability of data set 2 is reliable as it is the crime reports that were created by the police officers. The validity of the study for data set 1 is good because it is a large number of people with a large age range. The validity of the study for data set 2 is good because it is the police records. For data set 1, I think it is generalizable because it covers a large amount of people with a large age range. For data set 2, for data set 2 it is generalizable because it is prewritten records that are known to be true.
Problem 3
Question 1
The <- notation is equivalent to an = sign in R and is often used to declare variables. After running this code chunk, the named dataframe df appears in the environment on the right-hand side of RStudio.
df <- read_csv('https://www.openintro.org/data/csv/babies.csv')Rows: 1236 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (8): case, bwt, gestation, parity, age, height, weight, smoke
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Question 2
The notation Hmisc:: directly calls this function from the Hmisc package. describe() is a common function name, and sometimes this is needed to indicate to R which function from which package you want to use. The pipe feature |> sends the results of the first line directly into the function on the 2nd line and is a convenient way to chain functions together.
This code prints a useful and attractive summary of the data set we are using.
Hmisc::describe(df) |>
html()8 Variables 1236 Observations
case
n missing distinct Info Mean Gmd .05 .10 .25
1236 0 1236 1 618.5 412.3 62.75 124.50 309.75
.50 .75 .90 .95
618.50 927.25 1112.50 1174.25
lowest : 1 2 3 4 5 , highest: 1232 1233 1234 1235 1236
bwt
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1236 | 0 | 107 | 1 | 119.6 | 20.33 | 88.0 | 97.0 | 108.8 | 120.0 | 131.0 | 142.0 | 149.0 |
gestation
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1223 | 13 | 106 | 0.999 | 279.3 | 16.57 | 252.0 | 262.0 | 272.0 | 280.0 | 288.0 | 295.8 | 302.0 |
parity
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 1236 | 0 | 2 | 0.57 | 315 | 0.2549 | 0.3801 |
age
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1234 | 2 | 30 | 0.997 | 27.26 | 6.506 | 19 | 20 | 23 | 26 | 31 | 36 | 38 |
height
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1214 | 22 | 19 | 0.986 | 64.05 | 2.839 | 60 | 61 | 62 | 64 | 66 | 67 | 68 |
Value 53 54 56 57 58 59 60 61 62 63 64 65
Frequency 1 1 1 1 10 26 55 105 131 166 183 182
Proportion 0.001 0.001 0.001 0.001 0.008 0.021 0.045 0.086 0.108 0.137 0.151 0.150
Value 66 67 68 69 70 71 72
Frequency 153 105 54 20 13 6 1
Proportion 0.126 0.086 0.044 0.016 0.011 0.005 0.001
For the frequency table, variable is rounded to the nearest 0
weight
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1200 | 36 | 105 | 0.999 | 128.6 | 22.39 | 102.0 | 105.0 | 114.8 | 125.0 | 139.0 | 155.0 | 170.0 |
smoke
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 1226 | 10 | 2 | 0.717 | 484 | 0.3948 | 0.4782 |
Question 3
The Child Health and Development Studies investigate a range of topics. One study, in particular, considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The variables in this data set are as follows.
| Variable Name | Variable Description | Variable Type |
|---|---|---|
case |
id number | categorical, multi-categorical |
bwt |
birthweight, in ounces | numerical |
gestation |
length of gestation, in days | numerical |
parity |
binary indicator for a first pregnancy (0 = first pregnancy) | categorical, binary |
age |
mother’s age in years | numerical |
height |
mother’s height in inches | numerical |
weight |
mother’s weight in pounds | numerical |
smoke |
binary indicator for whether the mother smokes | categorical, binary |
Question 4
Below, 2 numeric variables were investigated for potential relationships. The independent, explanatory variable I chose is gestation, and the dependent, response variable I chose is bwt.
df |>
ggplot(aes(x = gestation, y = bwt))+
geom_point()Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_point()`).
Describe what you see in your plot here.
The gestastion period of 250-300 is the highest, with the average birthweight from 100-150.
Session Info
This portion of the document describes the conditions in RStudio under which this report was created. This is important to include so that work is reproducible by others.
xfun::session_info()R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5
Locale: en_US.UTF-8 / en_US.UTF-8 / en_US.UTF-8 / C / en_US.UTF-8 / en_US.UTF-8
Package version:
askpass_1.2.0 backports_1.5.0 base64enc_0.1-3
bit_4.0.5 bit64_4.0.5 blob_1.2.4
broom_1.0.6 bslib_0.8.0 cachem_1.1.0
callr_3.7.6 cellranger_1.1.0 checkmate_2.3.2
cli_3.6.3 clipr_0.8.0 cluster_2.1.6
colorspace_2.1-1 compiler_4.4.1 conflicted_1.2.0
cpp11_0.4.7 crayon_1.5.3 curl_5.2.1
data.table_1.15.4 DBI_1.2.3 dbplyr_2.5.0
digest_0.6.37 dplyr_1.1.4 dtplyr_1.3.1
evaluate_0.24.0 fansi_1.0.6 farver_2.1.2
fastmap_1.2.0 fontawesome_0.5.2 forcats_1.0.0
foreign_0.8-86 Formula_1.2-5 fs_1.6.4
gargle_1.5.2 generics_0.1.3 ggplot2_3.5.1
glue_1.7.0 googledrive_2.1.1 googlesheets4_1.1.1
graphics_4.4.1 grDevices_4.4.1 grid_4.4.1
gridExtra_2.3 gtable_0.3.5 haven_2.5.4
highr_0.11 Hmisc_5.1-3 hms_1.1.3
htmlTable_2.4.3 htmltools_0.5.8.1 htmlwidgets_1.6.4
httr_1.4.7 ids_1.0.1 isoband_0.2.7
jquerylib_0.1.4 jsonlite_1.8.8 knitr_1.48
labeling_0.4.3 lattice_0.22.6 lifecycle_1.0.4
lubridate_1.9.3 magrittr_2.0.3 MASS_7.3.60.2
Matrix_1.7.0 memoise_2.0.1 methods_4.4.1
mgcv_1.9.1 mime_0.12 modelr_0.1.11
munsell_0.5.1 nlme_3.1.164 nnet_7.3-19
openssl_2.2.1 parallel_4.4.1 pillar_1.9.0
pkgconfig_2.0.3 prettyunits_1.2.0 processx_3.8.4
progress_1.2.3 ps_1.7.7 purrr_1.0.2
R6_2.5.1 ragg_1.3.2 rappdirs_0.3.3
RColorBrewer_1.1.3 readr_2.1.5 readxl_1.4.3
rematch_2.0.0 rematch2_2.1.2 reprex_2.1.1
rlang_1.1.4 rmarkdown_2.28 rpart_4.1.23
rstudioapi_0.16.0 rvest_1.0.4 sass_0.4.9
scales_1.3.0 selectr_0.4.2 splines_4.4.1
stats_4.4.1 stringi_1.8.4 stringr_1.5.1
sys_3.4.2 systemfonts_1.1.0 textshaping_0.4.0
tibble_3.2.1 tidyr_1.3.1 tidyselect_1.2.1
tidyverse_2.0.0 timechange_0.3.0 tinytex_0.52
tools_4.4.1 tzdb_0.4.0 utf8_1.2.4
utils_4.4.1 uuid_1.2.1 vctrs_0.6.5
viridis_0.6.5 viridisLite_0.4.2 vroom_1.6.5
withr_3.0.1 xfun_0.47 xml2_1.3.6
yaml_2.3.10